{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This lab gives an overview of the Nvidia Nsight Tool and steps to profile an application with NVIDIA Nsight™ Systems command line interface with NVTX API. You will learn how to integrate NVTX markers in your application to trace CPU events when profiling using Nsight tools. \n",
    "\n",
    "Before we begin, let us execute the below cell to display information about the NVIDIA® CUDA® driver and the GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by clicking on it with your mouse, and pressing Ctrl+Enter, or pressing the play button in the toolbar above. You should see some output returned below the grey cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NVIDIA Profiler\n",
    "\n",
    "### What is profiling\n",
    "Profiling is the first step in optimizing and tuning your application. Profiling an application would help us understand where most of the execution time is spent. You will gain an understanding of its performance characteristics and can easily identify parts of the code that present opportunities for improvement. Finding hotspots and bottlenecks in your application, can help you decide where to focus our optimization efforts.\n",
    "\n",
    "### NVIDIA Nsight Tools\n",
    "NVIDIA offers Nsight tools (Nsight Systems, Nsight Compute, Nsight Graphics), a collection of applications which enable developers to debug, profile the performance of CUDA, OpenACC, or OpenMP applications. \n",
    "\n",
    "Your profiling workflow will change to reflect the individual Nsight tools. Start with Nsight Systems to get a system-level overview of the workload and eliminate any system level bottlenecks, such as unnecessary thread synchronization or data movement, and improve the system level parallelism of your algorithms. Once you have done that, then proceed to Nsight Compute or Nsight Graphics to optimize the most significant CUDA kernels or graphics workloads, respectively. Periodically return to Nsight Systems to ensure that you remain focused on the largest bottleneck. Otherwise the bottleneck may have shifted and your kernel level optimizations may not achieve as high of an improvement as expected.\n",
    "\n",
    "- **Nsight Systems** analyze application algorithm system-wide\n",
    "- **Nsight Compute** debug and optimize CUDA kernels \n",
    "- **Nsight Graphics** debug and optimize graphic workloads\n",
    "\n",
    "<img src=\"../images/Nsight Diagram.png\" width=\"80%\" height=\"80%\">\n",
    "*The data flows between the NVIDIA Nsight tools.*\n",
    "\n",
    "In this lab, we only focus on Nsight Systems to get the system-wide actionable insights to eliminate bottlenecks.\n",
    "\n",
    "### Introduction to Nsight Systems \n",
    "Nsight Systems tool offers system-wide performance analysis in order to visualize application’s algorithms, help identify optimization opportunities, and improve the performance of applications running on a system consisting of multiple CPUs and GPUs.\n",
    "\n",
    "#### Nsight Systems Timeline\n",
    "- CPU rows help locating CPU core's idle times. Each row shows how the process' threads utilize the CPU cores.\n",
    "<img src=\"../images/cpu.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "- Thread rows shows a detailed view of each thread's activity including OS runtime libraries usage, CUDA API calls, NVTX time ranges and events (if integrated in the application).\n",
    "<img src=\"../images/thread.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "- CUDA Workloads rows display Kernel and memory transfer activites. \n",
    "<img src=\"../images/cuda.png\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "### Profiling using command line interface \n",
    "To profile your application, you can either use the Graphical User Interface(GUI) or Command Line Interface (CLI). During this lab, we will profile the mini application using CLI.\n",
    "\n",
    "The Nsight Systems command line interface is named `nsys`. Below is a typical command line invocation:\n",
    "\n",
    "`nsys profile -t openacc,nvtx --stats=true --force-overwrite true -o laplace ./laplace`\n",
    "\n",
    "where command switch options used for this lab are:\n",
    "- `profile` – start a profiling session\n",
    "- `-t`: Selects the APIs to be traced (nvtx and openacc in this example)\n",
    "- `--stats`: if true, it generates summary of statistics after the collection\n",
    "- `--force-overwrite`e: if true, it overwrites the existing generated report\n",
    "- `-o` – name for the intermediate result file, created at the end of the collection (.qdrep filename)\n",
    "\n",
    "**Note**: You do not need to memorize the profiler options. You can always run `nsys --help` or `nsys [specific command] --help` from the command line and use the necessary options or profiler arguments.\n",
    "For more info on Nsight profiler and NVTX, please see the __[Profiler documentation](https://docs.nvidia.com/nsight-systems/)__.\n",
    "\n",
    "### How to view the report\n",
    "<a name=\"gui-report\"></a>\n",
    "When using CLI to profile the application, there are two ways to view the profiler's report. \n",
    "\n",
    "1) On the Terminal using `--stats` option: By using `--stats` switch option, profiling results are displayed on the console terminal after the profiling data is collected.\n",
    "\n",
    "<img src=\"../images/laplas3.png\" width=\"100%\" height=\"100%\">\n",
    "\n",
    "2) NVIDIA Nsight Systems GUI: After the profiling session ends, a `*.qdrep` file will be created. This file can be loaded into Nsight Systems GUI using *File -> Open*. If you would like to view this on your local machine, this requires that the local system has CUDA toolkit installed of same version and the Nsight Systems GUI version should match the CLI version. More details on where to download CUDA toolkit can be found in the “Links and Resources” at the end of this page.\n",
    "\n",
    "To view the profiler report, simply open the file from the GUI (File > Open).\n",
    "\n",
    "<img src=\"../images/nsight_open.png\" width=\"80%\" height=\"80%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using NVIDIA Tools Extension (NVTX) \n",
    "<a name=\"nvtx\"></a>\n",
    "NVIDIA Tools Extension (NVTX) is a C-based Application Programming Interface (API) for annotating events, time ranges and resources in applications. NVTX brings the profiled application’s logic into the Profiler, making the Profiler’s displayed data easier to analyse and enables correlating the displayed data to profiled application’s actions.  \n",
    "\n",
    "During this lab, we profile the application using Nsight Systems command line interface and collect the timeline. We will also be tracing NVTX APIs (already integrated into the application). The NVTX tool is a powerful mechanism that allows users to manually instrument their application. NVIDIA Nsight Systems can then collect the information and present it on the timeline. It is particularly useful for tracing of CPU events and time ranges and greatly improves the timeline's readability. \n",
    "\n",
    "**How to use NVTX**: Add `#include \"nvtx3/nvToolsExt.h\"` in your source code and wrap parts of your code which you want to capture events with calls to the NVTX API functions. For example, try adding `nvtxRangePush(\"main\")` in the beginning of your `main()` function, and `nvtxRangePop(`) just before the return statement in the end.\n",
    "\n",
    "The sample code snippet below shows the use of range events.The resulting NVTX markers can be viewed in Nsight Systems _Timeline View_. \n",
    "\n",
    "```cpp\n",
    "    nvtxRangePushA(\"init\");\n",
    "    initialize(A, Anew, m, n);\n",
    "    nvtxRangePop();\n",
    "\n",
    "    printf(\"Jacobi relaxation Calculation: %d x %d mesh\\n\", n, m);\n",
    "\n",
    "    double st = omp_get_wtime();\n",
    "    int iter = 0;\n",
    "\n",
    "    nvtxRangePushA(\"while\");\n",
    "    while ( error > tol && iter < iter_max )\n",
    "    {\n",
    "        nvtxRangePushA(\"calc\");\n",
    "        error = calcNext(A, Anew, m, n);\n",
    "        nvtxRangePop();\n",
    "\n",
    "        nvtxRangePushA(\"swap\");\n",
    "        swap(A, Anew, m, n);\n",
    "        nvtxRangePop();\n",
    "\n",
    "        if(iter % 100 == 0) printf(\"%5d, %0.6f\\n\", iter, error);\n",
    "\n",
    "        iter++;\n",
    "    }\n",
    "    nvtxRangePop();\n",
    "   \n",
    "```\n",
    "\n",
    "<img src=\"../images/nvtx.PNG\" width=\"80%\" height=\"80%\">\n",
    "\n",
    "**Using NVTX with Fortran** Being a C API in order to use NVTX in Fortran, wrappers must be written calling C API and exposed as a module. As part of this tutorial the file `nvtx.f90` consists of wrapper sub routines for NVTX API.  \n",
    "\n",
    "Detailed NVTX documentation can be found under the __[CUDA Profiler user guide](https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvtx)__."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Steps to follow\n",
    "To obtain the best performance from GPU and utilize the hardware, one should follow the cyclical process (analyze, parallelize, optimize). \n",
    "\n",
    "- **Analyze**: In this step, you first identify the portion of your code that includes most of the computation and most of the execution time is spent. From here, you find the hotspots, evaluate the bottlenecks and start investigating GPU acceleration.\n",
    "\n",
    "- **Parallelize**: Now that we have identified the bottlenecks, we use use the techniques to paralellise the routines where most of the time is spent.\n",
    "\n",
    "- **Optimize**:  To further improve the performance, one can implement optimization strategies step by step in an iterative process including: identify optimization opportunity, apply and test the optimization method, verify and repeat the process.\n",
    "\n",
    "Note: The above optimization is done incrementally after investigating the profiler output.\n",
    "\n",
    "We will follow the optimization cycle for porting and improving the code performance.\n",
    "\n",
    "<img src=\"../images/Optimization_Cycle.jpg\" width=\"80%\" height=\"80%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started \n",
    "In the following sections, we parallelise and optimize the serial [RDF](../../../nways_MD/English/C/jupyter_notebook/serial/rdf_overview.ipynb) using different approaches to the GPU programming following the above steps. For each section, inspect the code, compile, and profile it. Then, investigate the profiler’s report to identify the bottlenecks and spot the optimization opportunities.  At each step, locate problem areas in the application and make improvements iteratively to increase performance.\n",
    "\n",
    "This lab comprises of multiple exercises, each follows the optimization cycle method. For each exercise, compile the code, validate the output (more instruction in the labs) and profile it. You will profile the code with Nsight Systems (`nsys`), identify certain areas/kernels in the code, where they don't behave as expected. \n",
    "\n",
    "**NOTE**: Example screenshots are for reference only and you may not get identical profiler report."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-----\n",
    "\n",
    "<!--# <div style=\"text-align: center ;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em\">[RETURN](rdf_overview.ipynb)</div>-->\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Links and Resources\n",
    "\n",
    "\n",
    "[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)\n",
    "\n",
    "\n",
    "**NOTE**: To be able to see the Nsight Systems profiler output, please download the latest version of Nsight Systems from [here](https://developer.nvidia.com/nsight-systems).\n",
    "\n",
    "Don't forget to check out additional [Open Hackathons Resources](https://www.openhackathons.org/s/technical-resources) and join our [OpenACC and Hackathons Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.\n",
    "\n",
    "--- \n",
    "\n",
    "## Licensing \n",
    "\n",
    "Copyright © 2022 OpenACC-Standard.org.  This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
